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Abstract 

Efficient and accurate joint representation of a collection 
of images, that belong to the same class, is a major research 
challenge for practical image set classification. Exist¬ 
ing methods either make prior assumptions about the data 
structure, or perform heavy computations to learn structure 
from the data itself In this paper, we propose an efficient 
image set representation that does not make any prior as¬ 
sumptions about the structure of the underlying data. We 
learn the non-linear structure of image sets with Deep Ex¬ 
treme Learning Machines (DELM) that are very efficient 
and generalize well even on a limited number of training 
samples. Extensive experiments on a broad range of public 
datasets for image set classification (Honda/UCSD, CMU 
Mobo, YouTube Celebrities, Celebrity-1000, ETH-80) show 
that the proposed algorithm consistently outperforms state- 
of-the-art image set classification methods both in terms of 
speed and accuracy. 

1. Introduction 

Image set based classification has attracted significant 
interest from the computer vision and pattern recognition 
community due to its wide range of applications in multi¬ 
view object classification [14, 28, 26, 27, 5, 24] and face 
recognition [3, 7, 22, 23, 21, 6, 20]. Image set classification 
naturally arises in many applications when a given collec¬ 
tion of images are known to belong to one class but with 
unknown identity. In contrast to the traditional paradigm of 
single image based classification, algorithms for image set 
classification exploit this information to obtain a more accu¬ 
rate estimate of the class identity. Multiple images of a set 
usually contain a range of intra-class appearance variations 
such as pose, illumination and scale changes, which can be 
explicitly or implicitly modelled for improved classification 
accuracy [7, 14, 6, 2 ]. Image set based classification may 
also be considered as a generalization of video based ob¬ 


ject classification. However, image set classification does 
not assume any temporal relationship between the images 
that constitute the set. Thus, image set classification is also 
applicable in situations where the set samples have large 
variations without any temporal relationship [7, 27]. 

An image set classification algorithm must essentially 
address two core challenges; how to represent an image 
set to effectively capture intra-image as well as inter-image 
variations and how to define a distance/similarity measure 
between two image sets. Defining a suitable distance be¬ 
tween two image sets is often tied to the representation used 
to model the image sets in the first place. Hence, most of 
the research in this area has concentrated on developing im¬ 
age set representations based on certain assumptions about 
the set structure. Some techniques assume that the set data 
follows a Gaussian distribution [26, 27, 21, 2 ] which is 
unlikely to be true for all types of images. Other methods 
assume that an image set can be represented by linear sub¬ 
spaces [14, 5], whereas the data may lie on complex man¬ 
ifolds [6]. To model more complex data structures, sev¬ 
eral techniques have been proposed to model image sets 
as a convex or affine hulls of the data samples [3, 7, 20]. 
These techniques are conceptually similar to nearest neigh¬ 
bor classification and must impose certain constraints to 
avoid finding the neighbors in some low dimensional space 
where image sets might intersect. However, the ability to 
model more complex image set structures comes at the cost 
of added algorithm complexity [28, 26, 7, 21, 20, ( ]. There¬ 
fore, these algorithms cannot be efficiently scaled to handle 
large image set classification tasks [18]. 

In this work, we have focused on developing an efficient 
and accurate representation of image sets that can model 
arbitrarily complex image set structures on one hand, and 
scale to large problem sizes on the other. We employ Ex¬ 
treme Learning Machines (ELM) for this purpose primarily 
due to their computational efficiency [1 1, 9, 10, 8, 19]. An 
ELM trains a single hidden layer feed-forward neural net- 
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Figure 1. Illustration of the proposed algorithm. During training, we first learn a domain-specific Deep Extreme Learning Machines 
(DELM) model L g- Starting from the domain specific model, we then learn class-specific DELM models L j for the gallery sets of each 
class separately. Given a probe image set Xt, we first reconstruct each of its samples using the learned DELM models and then estimate 
its label based on the smallest reconstruction error. Finally, majority voting is used to estimate the label of the probe set as a whole. 


work (SLFN) by randomly assigning weights to the input 
layer and analytically computing the weights for the output 
layer. Deep ELMs have the potential of effectively learn¬ 
ing the underlying structure of the image set without any 
prior assumption on the distribution or structure of image 
set data. Our algorithm learns a Deep ELM (DELM) model 
for each class in the gallery (training classes) through unsu¬ 
pervised feature learning with an ELM based auto-encoder 
(ELM-AE) (Fig. 1). A label is assigned to a probe (test) set 
based on minimum reconstruction error. 

The key contributions of this paper are three-fold: (1) 
An effective image set representation scheme based on 
Deep Extreme Learning Machines that does not make any 
assumption about the structure of the set but implicitly 
learns it from training data. (2) Unlike existing deep 
learning based methods, our algorithm does not require a 
large amount of training data. (3) The proposed network 
is extremely fast both in training and test - training is 
6,000 times faster than the state-of-the-art best-performing 
method, whereas the testing is 9 times faster. The proposed 
algorithm is extensively evaluated for image set based face 
recognition and object categorization on five benchmark 
datasets including Honda/UCSD [16], CMU Mobo [4], 
YouTube Celebrities [15], Celebrity-1000 [18] and ETH-80 


[17]. Results demonstrate that our algorithm consistently 
outperforms existing methods in terms of accuracy, while 
achieving substantial speedups at the same time. 

2. Related Work 

Existing image set classification techniques can be cat¬ 
egorized into sample based and structure based set-to-set 
matching methods. Sample based techniques compute 
the distance between the nearest neighbors of two image 
sets under certain constraints. For example, Cevikalp and 
Triggs [ 3 ] model each image set as a convex geometric 
region in feature space. Set dissimilarity between the re¬ 
gions represented by the affine (AHISD) or convex hulls 
(CHISD) is measured by the distance of closest point ap¬ 
proach. For the affine hull model, the distance is mini¬ 
mized using least squares while for the convex hull model, 
an SVM is trained to separate the two sets. Hu et al. [ ] ap¬ 
proximate each of the two nearest points between two im¬ 
age sets by a sparse combination of the corresponding set 
samples. The sparse approximated nearest points (SANP) 
lie close to some facets of the affine hulls and hence, im¬ 
plicitly incorporate structural information while matching 
two sets. To find more accurate nearest points between two 
image sets, Mian et al. [ 23 ] introduced self-regularized non¬ 
negative coding to define between set distance. They con- 












































strained the orthogonal basis vectors to be similar to the 
approximated nearest points and added the non-negativity 
constraint on the set samples while approximating nearest 
points. Mahmood et al. [22] performed spectral cluster¬ 
ing on the combined gallery and test samples. The class- 
cluster distributions of the set samples were then used for 
classification. Lu et al. [20] jointly learn a structured dic¬ 
tionary and projection matrix to map set samples into a 
low-dimensional subspace. The low dimensional samples 
are then represented using sparse codes and classification is 
performed based on the traditional minimum reconstruction 
error and majority voting scheme. In general, sample based 
methods are highly prone to outliers and are computation¬ 
ally expensive for large galleries. 

Structure based techniques model an image set with 
one or more linear subspaces. Structural similarity is then 
measured using a subspace-to-subspace distance. Kim et 
al. [14] perform discriminant analysis on the canonical cor¬ 
relations calculated between set structures. Wang et al. [2\ ] 
model an image set with multiple local clusters and rep¬ 
resent each cluster with a linear subspace. Subspace dis¬ 
tance between the nearest local clusters of two sets is then 
used for classification. Chen et al. [24] proposed sparse ap¬ 
proximated nearest subspaces (SANS) to extract local lin¬ 
ear models from the gallery image sets via sparse repre¬ 
sentation. By forcing the clusters of the query image set 
to resemble clusters in the gallery image sets, only cor¬ 
responding clusters are matched using the subspace based 
distance. Wang and Chen [26] proposed Manifold Dis¬ 
criminant Analysis (MDA) which models each image set 
using multiple local linear clusters. These clusters are trans¬ 
formed by a linear discriminant operator to separate dif¬ 
ferent classes. Here, the set-to-set similarity is measured 
using pair-wise local cluster distances in the learned em¬ 
bedding space. Harandi et al. [5] modeled the image set 
structure with linear subspaces as points lying on Grass- 
mannian manifolds. They define kernels to map points from 
the Grassmannian manifold to Euclidean space where clas¬ 
sification is performed by graph embedding discriminant 
analysis. Wang et al. [27] model the structure of each im¬ 
age set directly using a covariance matrix. They map the 
covariance matrix of each image set from the Riemannian 
manifold to the Euclidean space by a kernel function based 
on the Log Euclidean distance. Image sets are then classi¬ 
fied according to a learned regression function using Kernel 
Partial Least Squares. Hay at et al. [( ] learn the structure of 
each gallery image set using a deep learning model. The 
label of the test set is then estimated based on the minimum 
reconstruction error and majority voting scheme. Generally, 
structure based algorithms require a relatively large number 
of images in each set (dense sampling) in order to accurately 
model the underlying structure. 

We propose a structure based image set classification al¬ 


gorithm that neither makes prior assumptions about the set 
structure nor incur a heavy computational burden to learn 
the structure from the data. The proposed representation is 
based on deep Extreme Learning Machines and automat¬ 
ically learns the non-linear structure of image sets. The 
proposed algorithm is extremely efficient to train and gen¬ 
eralizes very well even with a limited number of training 
samples. 

3. Proposed Methodology 

We first give a brief overview of Extreme Learning Ma¬ 
chines (ELMs) and how they differ from other learning 
paradigms. Then, we discuss how to extend the traditional 
ELM idea to multiple layers, thus, allowing a deeper repre¬ 
sentation. Finally, we show how image set classification can 
be formulated using the Deep ELM (DELM) models and 
how it can benefit from ELM’s attractive properties, namely 
very efficient learning (easily scalable to large datasets) and 
generalizability (no prior assumptions on the set data). 

3.1. Extreme Learning Machines 

Consider a supervised learning problem with N train¬ 
ing samples, {X,T} = {xj, where Xj G and 

t j G M 9 are the j th input and target samples respectively. 
d and q are the input and target feature dimensions respec¬ 
tively. For the task of classification, t j is the class label 
vector while for regression t j represents the desired output 
feature. In either case, we seek a regressor function from the 
inputs to the targets. One popular form of this function is the 
standard single hidden layer feed-forward network (SLFN), 
where rih hidden nodes fully connect the d inputs to the q 
outputs. This is done through an activation function g(u). 
The predicted output vector Oj generated by feeding for¬ 
ward xj through an SLFN is mathematically modelled as 

rih 

Oj = Pi9( w7 Xj + bi) (1) 

i=1 

where G is the weight vector connecting the i- th hid¬ 
den node and the input nodes, /3 i G R q is the weight vector 
connecting the i- th hidden node and the output nodes, and bi 
is the bias of the i-th hidden node. The activation function 
g(u) can be any non-linear piecewise continuous function, 
such as the sigmoid function g{u ) = 1+ *_ u . 

An ELM learns the parameters of an SLFN (i.e. 
{win two sequential stages: random feature 
mapping and linear parameter solving. In the first ELM 
stage, the hidden layer parameters ({ware ran¬ 
domly initialized to project the input data into a random 
ELM feature space using the the mapping function g(.). It 
is this random projection stage that differentiates ELM from 
most existing learning paradigms, which perform determin¬ 
istic feature mapping. For example, an SVM uses kernel 



functions, while deep neural networks [ ] use Restricted 
Boltzmann machines (RBM) for feature mapping/learning. 
By randomizing the feature mapping stage, the ELM can 
discover non-linear structures in the data without the need 
for priors, which are inherently the case for deterministic 
feature mapping schemes. Also, these parameters are set 
randomly and are not subsequently updated, thus decou¬ 
pling them from the output parameters {/3 i }^ 1 , which can 
be learned in a very efficient manner as we will see next. 
This decoupling strategy significantly speeds up the param¬ 
eter learning process in ELM, thus, making it much more 
computationally attractive than deep neural network archi¬ 
tectures that learn all network parameters iteratively. 

In the second ELM stage, the parameters connecting the 
hidden layer and the output layer {i.e. are learned 

efficiently using regularized least squares. Here, we denote 
^( x j) = b(w Jxj + h)... + b nh )\ E R lx "'* 

as the response vector of the hidden layer to the input Xj 
and B G R nhXq as the output parameters connecting the 
hidden and output layers. An ELM aims to solve for B by 
minimizing the sum of the squared losses of the prediction 
errors: 
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s.t. V’( x i)B = t J-eJ, j = l,...,N 


In (2), the first term is a regularizer against over-fitting, 
ej G R q is the error vector with respect to the j- th training 
pattern (i.e. ej = t j — oj), and C is a tradeoff coefficient. 
By concatenating H = [^(xi) T • • •' 0 (xat) t ] t G R Nxn h 
and T = [ti * • • tjv] T G R Nxq , we obtain an equivalent un¬ 
constrained optimization problem, which is widely known 
as ridge regression or regularized least squares. 

min I||B|| 2 f + ^||T — HB|||, (3) 


Since the above problem is convex, its global solution needs 
to satisfy the following linear system. 

B + CH t (T - HB) = 0. (4) 

The solution to this system depends on the nature and size 
of matrix H. If H has more rows than columns and is of full 
column rank (which usually is the case when N > ), the 

system is overdetermined and a closed form solution exists 
for (3) in (5), where I nh R nhXnh is an identity matrix. Note 
that in practice, rather than explicitly inverting the rih x nh 
matrix, we obtain B* by solving the linear system in a more 
efficient and numerically stable manner. 

H t T (5) 

If N < rih, H will have more columns than rows, which 
often leads to an under-determined least squares problem. 


B* = H 1 H 



Figure 2. Layer wise training of a Deep ELM model with h hidden 
layers and input X 


In this case, B may have infinite number of solutions. In this 
case, we restrict B to be a linear combination of the rows 
of H : B = H T a (a G R Nxq ). Note that when H has 
more columns than rows and is of full row rank, then HH T 
is invertible. Multiplying both sides of (4) by (HH T ) -1 H, 
we obtain a closed form solution for B* 

( 6 ) 

To summarize, ELMs have two notably attractive features. 
Firstly, the parameters of the hidden mapping function can 
be randomly generated according to any continuous proba¬ 
bility distribution e.g. the uniform distribution on [—1,1]. 
Secondly, as such, the only parameters to be learned in 
training are the output weights between the hidden and out¬ 
put nodes. This can be done by solving a single linear sys¬ 
tem or even in closed form. These two features make ELMs 
more flexible than SVMs and much more computationally 
attractive than traditional feed-forward neural networks that 
use back-propagation [9]. 

3.2. Learning Representations with ELMs 

Learning rich representations efficiently is crucial for 
achieving high generalization performance, especially at 
large scales. This form of learning can usually be done us¬ 
ing stacked auto-encoders (S AE) and stacked auto-decoders 
(SDA), where a parametric regressor function is learned to 
map the input to itself. Although deep neural networks can 
be learned for this purpose and have been shown to yield 
good performance in various computer vision tasks [1, 2], 
they are generally very slow in training. To address the 
problem, we learn representations in an unsupervised way 
using an ELM based auto-encoder [] ], which in essence is 
a multi-layer feed-forward network whose parameters are 
learned by cascading multiple layers of ELM. This ELM- 
based learning procedure is highly efficient and has good 
generalization capabilities. 

The deep ELM auto-encoder is designed by setting the 
targets of the multi-layer network to the input i.e. T = X. 


B* = H T a* = H t 
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Figure 2 shows the process of learning a DELM model from 
the samples of the training set X. Here, we consider a fully 
connected multi-layer network with h hidden layers. Let 
L = {W 1 ,..., W^ +1 } denote the parameters of the DELM 
that need to be learned, where W* = [w\ ,..., ] T E 
M ni +i xn \ To simplify training, each layer is decoupled 
within the network and processed as an ELM, whose tar¬ 
gets are the same as its inputs. As shown in Figure 2, W 1 is 
learned by considering a corresponding ELM with T = X. 

The weight vectors connecting the input layer to each 
unit of the first hidden layer are orthonormal to each other, 
effectively resulting in projection of the input data to a ran¬ 
dom subspace. Compared to initializing random weights in¬ 
dependent of each other, orthogonalization of these random 
weights tends to better preserve pairwise distances in the 
random ELM feature space [12] and improves ELM auto¬ 
encoding generalization performance. Next, Bi is calcu¬ 
lated using (5) or (6) depending on the number of nodes in 
the hidden layer. Note that, Bi re-projects the lower dimen¬ 
sional representation of the input data back to its original 
space while minimizing the reconstruction error. Therefore, 
this projection matrix is data-driven and hence used as the 
weights of the first layer (W 1 = B^). Similarly, W 2 is 
learned by setting the input and output of Layer 2 to Hi i.e. 
the output of Layer 1. In this manner, all parameters of the 
DELM can be computed sequentially. However, when the 
number of nodes between two consecutive layers is equal, 
the random projection obtained in the second layer is in the 
same space as the input of the first layer. Using 5 or 6 does 
not ensure orthogonality of the computed weight matrix B. 
Imposing orthogonality in this case results in a more accu¬ 
rate solution since the data always lies in the same space. 
Therefore, the output weights B are calculated as the solu¬ 
tion to the Orthogonal Procrustes problem. 

B* = min ||HB - T|||., (7) 

BeR"i> x G 

s.t. B t B = I 

The close form solution is obtained by finding the nearest 
orthogonal matrix to the given matrix M = H t T. To find 
the orthogonal matrix B*, we use the singular value decom¬ 
position M = UXV T to compute B* = UV T . 

In ELM-AE, the orthogonal random weights and biases 
of the hidden nodes project the input data to a different or 
equal dimension space. The DELM models can automati¬ 
cally learn the non-linear structure of data in a very efficient 
manner. In contrast to deep networks, DELM also does not 
require expensive iterative fine tuning of the weights. 

3.3. Deep ELM models for Image set Classification 

DELM based image set classification is performed in 
two steps. We first learn a global domain-specific DELM 
model using all the training image data and then build class- 
specific DELM models using the global representation as an 


initialization. In doing so, we encode both domain level and 
class-specific properties of the data. 

Let G = {X m }^ =1 E R dxN be the gallery contain¬ 
ing c image sets of c different classes and N images: N = 
YTm =i where s m is the number of image samples in the 
ra-th image set. Let X m = {x^}^ E R dxs m be the ra-th 
image set, where E R d is a d-dimensional feature vector 
obtained by vectorizing the pixels of the i-th image. Instead 
of pixel values, the vector x^ may also contain other fea¬ 
tures, e.g. local binary patterns (LBP). While s m can vary 
across image sets, the dimensionality of x^ remains fixed. 
Let Y = be the class labels of the image sets in 

G. For a test image set X t = £ R dxs \ the prob¬ 

lem of image set classification involves estimating the label 
Y t of X t given the gallery G. 

Training: We learn a global domain-specific DELM 
model by initializing its weights using the ELM auto¬ 
encoding procedure described earlier. This global DELM 
is a multi-layer neural network with h hidden layers. Its 
parameters are learned using the images in G in an unsu¬ 
pervised manner. The global DELM model is represented 
as L g = {W^k,..., W^ +1 }, where is the weight ma¬ 
trix of the i th layer learned using the auto-encoding method 
in Section 3.2. The global DELM model serves as a starting 
point, from which we learn class-specific DELM models. 

Since Lq encodes domain-specific representation (as it 
has been trained to reconstruct any sample from that do¬ 
main), we use it to learn a separate DELM model for each 
of the c training classes. In other words, instead of ran¬ 
domly initializing the hidden layers weights, as in the con¬ 
ventional ELM, we use the weights in to initialize the 
class-specific models. Thus, we have c DELM models for c 
classes { 1 ^= 1 - where each class-specific model is repre- 
sented as L, = {Wj,Wj +1 }. 

The learned ELM models are able to encode complex 
non-linear structure of the training data due to their deep 
architecture with multiple non-linear layers. Compared to 
the previous structure based algorithms such as DCC [14], 
GGDA [5] and CDL [27], our proposed DELM models 
learn the structure of the image data in multiple parameters, 
therefore, it is capable of learning more complex non-linear 
manifold structures. Moreover, this DELM model is more 
computationally efficient than previous methods. 

Testing: Given a test image set X t = {xJ}® : L 1 , we pre¬ 
dict its label by first representing each image in this set us¬ 
ing each of the class-specific representations {L j} C j=i and 
assigning each image to the class that incurs the least re¬ 
construction error. Then, majority voting on the predicted 
image-level classes is performed to predict the class of the 
image set. The overall procedure is summarized in Algo¬ 
rithm 1. 

We reconstruct each test image xj in the set using each 




Algorithm 1 Proposed Image Set Classification Algorithm 

Input: : 

Gallery G = {X m }^ =1 containing c image sets X s = 
{^mYi=i £ R dxSrn belonging to c classes 
Class labels Y = {ymYm=i 
Probe set X t = {x-tYLi G R dxst 
Number of hidden layers h 
Output: : Label Y t of X t 
Training: 

Lg = {W^,...,W£ +1 } {Learn a domain-specific 

global DELM model with h hidden layers from G} 

for j = 1 : c do 

L j = {W],...W^ +1 } { Learn DELM models for 
each class} 

end for 
Testing: 

for i = 1 : st do 

for j = 1 : c do 

x} = /(xj; Lj){Reconstruct from model Lj (8)} 

^(j) = N 

end for 

l l t = arg min e 2 (j ) 
j 

end for 

= mode({Zj}fi 1 ) 


of the class-specific models {Lj} c - =1 . The reconstructed 
sample x} from a model L j is given by 

x} = /(xj,Lj) = 5 (w£ +1 5 (w£, ...,ff(wjxj))) (8) 

where / is the reconstruction and g is chosen to be the sig¬ 
moid function. The reconstruction error of sample xj is 
computed as the squared Euclidean distance between xj and 
x} as e l (j) = ||xj — x}||2. The predicted label l\ for sam¬ 
ple xj is chosen to be the class that incurs the minimum 
reconstruction error 

r t = arg min e\j) . (9) 

j 

Finally, the test image set X t is labelled using majority vot¬ 
ing on the set of predicted image-level labels. Formally, we 
set the image set label Y t = mode({/J}^ 1 ). 

4. Experimental Results 

We perform extensive experiments on five public 
datasets (see Fig. 3) and compare results to 10 state-of- 
the-art image set classification methods. These datasets 
have been widely used in the literature to evaluate image 
set based classification algorithms. Details of the datasets 
used, experimental protocol, and results obtained are pro¬ 
vided next. 


4.1. Dataset Specifications 

The Honda/UCSD dataset [1 ] contains 59 video se¬ 
quences of 20 different subjects. Similar to prior work [7, 
27], we use 20 x 20 histogram equalized face images ex¬ 
tracted from these videos. Each video sequence corre¬ 
sponds to an image set. 

The CMU MoBo dataset contains 96 video sequences 
of 24 different subjects. We use LBP features of the face 
images as in [3] for image set classification. 

The YouTube Celebrities [15] is a challenging dataset 
that contains 1,910 video sequences of 47 celebrities (ac¬ 
tors, actresses and politicians), collected from YouTube. 
Most videos are of low resolution and contain significant 
compression artifacts. There are upto 400 frames per video. 
We use the LBP features (d = 928) of 20 x 20 face images. 

The Celebrity-1000 database [18] is a large-scale un¬ 
constrained video database downloaded from Youtube and 
Youku. It contains 159,726 video sequences of 1,000 sub¬ 
jects covering a wide range of poses, illuminations, expres¬ 
sions and image resolutions. We follow the standard closed- 
set test protocol defined in [ 1 ] where four overlapping sub¬ 
sets of the dataset are created with increasing complexity 
containing 100, 200, 500, 1000 subjects. Each subset is 
further divided into training and test partitions with disjoint 
video sequences. Approximately 70% of the sequences are 
randomly selected to form the gallery and the rest are used 
as probes. We use the PC A reduced LBP+Gabor features 
provided by Liu et al. [18] . The feature dimension d is 
1651, 1790, 1815 and 1854 for the subsets 100, 200, 500 
and 1000 respectively. 

The ETH-80 dataset [17] contains 8 object categories, 
where each category has 10 different objects of the same 
class. Each object has 41 images at different views to make 
an image set. We use 20 x 20 intensity images for image set 
based object categorization. ETH-80 is a challenging as it 
has fewer images per set, significant appearance variations 
across objects of the same class and larger viewing angle 
differences within each image set. 

4.2. Experimental Setup 

We follow the standard experimental protocol [28, 26, 
3, 7, 27, 6] for a fair comparison with 10 state of the 
art algorithms including Discriminant Canonical Correla¬ 
tion (DCC) [14], Manifold-Manifold Distance [28], Man¬ 
ifold Discriminant Analysis (MDA) [26], Affine and Con¬ 
vex Hull based Image Set Distance (AHISD, CHISD) [3], 
Sparse Approximated Nearest Points (SANP) [7], Covari¬ 
ance Discriminative Learning (CDL) [27], Graph Embed¬ 
ding Grassmannian Discriminant Analysis (GGDA) [5], Set 
to Set Distance Metric Learning (SSDML) [29] and Non- 
Linear Reconstruction Models (NLRM) [6]. We use the 
source codes supplied by the original authors, except for 






Figure 3. Image sets from (a) Honda, (b) CMU Mobo, (c) Youtube Celebrities, (d) Exemplar video frames from the Celebrity-1000 dataset, 
(e) 8 Object categories and 10 different objects in one category of the ETH-80 dataset. 


MDA and CDL techniques. For MDA, Hus [7] implementa¬ 
tion is used, while we use our own implementation of CDL. 

Parameters of all the algorithms are selected empiri¬ 
cally and the best results are reported. For DCC [14], 
the subspace dimension and the corresponding maximum 
canonical correlations is set to 10. For MMD and MDA, 
we configure the parameters as recommended by the au¬ 
thors [28, 26]. The ratio between Euclidean distance and 
geodesic distance is selected from the range {1.0-5.0} for 
different data sets. The maximum canonical correlation is 
used in defining MMD. The number of connected nearest 
neighbors for computing geodesic distance in both MMD 
and MDA is set to 12. For AHISD, CHISD and SANP, the 
PCA energy used to represent an image set is selected from 
the range {80%, 85%, 90%, 95%, 99%} and the best results 
are reported for each dataset. For CHISD, we set the error 
penalty parameter C = 100. For GGDA, we set = 1 
fc[proj] = 100 an d v = 3. The number of eigenvectors used 
to represent an image set is set to 9 and 6 respectively for 
Mobo and YouTube Celebrities and 10 for all other datasets. 
No parameter settings are required for CDL and SSDML. 
For NLRM [ ], we used the network depth and model pa¬ 
rameters as recommended by the authors. The parameters 
of our algorithm include the number of hidden layers h , the 
number of neurons in each hidden layer nh and the param¬ 
eters C. We set the number of hidden layers h = 2 for 
all datasets. The parameter C is in the range {10 4 — 10 8 } 
for the first layer and {10 16 — 10 20 } for the last layer. The 
number of neurons in each hidden layer is 20 for Honda, 
Mobo and Celebrity-1000, 40 for Youtube, 150 for ETH80. 

For Honda and MoBo data sets, each subject has one 
video sequence in the gallery and the rest in probes. For 
DCC learning, at least two image-sets per class are re¬ 
quired in the gallery. Therefore, when the gallery contained 
only one image-set per class, we randomly partitioned it 


into two non-overlapping sub-sets. Experiments were re¬ 
peated 10-folds with different gallery/probe combinations. 
For Youtube dataset, we follow the experimental protocol 
of [7] and conduct five-fold cross-validation experiments. 
The videos are divided to make nine image sets per subject 
in each fold. In each fold, three image sets per subject are 
randomly selected for training and the rest are used for test¬ 
ing. For ETH-80 dataset, each class has 5 sets in the gallery 
for training and the remaining 5 sets are used as probes. 

4.3. Results and Analysis 

Table 1 reports the average and standard deviation recog¬ 
nition rate (%) for 10-fold experiments on Honda, Mobo 
and ETH datasets and 5-fold experiments on the Youtube 
dataset. Our approach performs better than competing al¬ 
gorithms on Youtube celebrities, CMU Mobo and ETH-80 
datasets and achieves perfect results on the Honda dataset. 
Recall that our algorithm involves no supervised discrimi¬ 
native analysis as in DCC, MDA and CDL, yet it performs 
superior in both accuracy and execution time. On the ETH- 
80 dataset, structure based algorithms [14, 28, 26, 6, 27] 
perform better than the sample based ones [3, 7] because 
the individual samples can not model significant intra-class 
pose and object appearance variations. 

Table 2 summarizes the image set classification results 
on all the splits of the Celebrity-1000 dataset. On the 
subset-100 (Celeb-100) our method achieves a 15% im¬ 
provement in classification accuracy compared to the ex¬ 
isting state-of-the-art. As the feature dimension and dataset 
size is huge, the training and testing time of all other meth¬ 
ods is very large on this dataset (for example on the Celeb- 
100 the NLRM [i ] method took about 60 hours for training 
and the MMD and MDA took more than 80 hours using 
a Core i7 3.4GHz CPU with 8GB RAM). In contrast, our 
method takes only 5.02 seconds for training and achieves 

















Table 1. Comparison of the average recognition rates and standard deviations (%) (Results are obtained by performing 10-fold experiments 
on Honda, Mobo and ETH datasets and 5-fold on Youtube celeberities dataset.) 



Honda 

MoBo 

ETH-80 

Youtube 

DCC[14] 

94.67=bl.32 

93.61=bl.76 

90.91 zb5.31 

66.75±4.47 

MMD [28] 

94.87ibl.16 

93.19=bl.66 

85.73±8.33 

65.12±4.36 

MDA [26] 

97.44±0.91 

95.97zbl.90 

80.50±6.81 

68.12±4.85 

GGDA[5] 

94.61±2.07 

85.75zbl.82 

85.75±6.41 

62.81±4.42 

CDL[2'; ] 

100.0±0.00 

95.83±2.07 

88.20±6.80 

68.96±5.29 

AHISD[3] 

89.74±1.85 

94.58±2.57 

74.76±3.31 

71.92zb4.55 

CHISD[3] 

92.31zb2.12 

96.52±1.18 

71.00zb3.93 

73.17±4.69 

SANP[7] 

93.08±3.43 

97.08dzl.03 

72.43±4.98 

74.01±4.68 

SSDML[29] 

89.41±3.64 

95.14±2.20 

81.00zb6.58 

70.81±3.42 

NLRM[6] 

100.0±0.0 

97.92zbl.76 

95.25±4.77 

73.55±4.74 

Proposed DELM 

100.0±0.0 

98.00±0.67 

96.00±3.51 

75.31±4.63 


Table 2. Comparison of the classification accuray on different subsets of Celeb-1000 dataset. 



Subset-100 

Subset-200 

Subset-500 

Subset-1000 

Average 

DCC[14] 

25.24 

10.38 

10.18 

- 

- 

MMD[ >8] 

17.52 

10.23 

9.79 

- 

- 

MDA[26] 

15.93 

9.21 

9.87 

- 

- 

GGDA[5] 

11.95 

8.24 

9.64 

- 

- 

CDL[27] 

11.95 

11.11 

10.65 

- 

- 

AHISD[3] 

19.92 

23.94 

18.97 

- 

- 

CHISD[3] 

20.31 

22.41 

18.35 

- 

- 

SANP[: ] 

20.71 

21.64 

19.12 

- 

- 

SSDML[29] 

18.32 

17.62 

9.96 

- 

- 

NLRM [6] 

34.66 

31.81 

27.68 

- 

- 

MTJSR[18] 

50.59 

40.80 

35.48 

30.03 

39.22 

Proposed DELM 

49.80 

45.21 

38.88 

28.83 

40.68 


better classification accuracy than all previous methods. 
Similarly, on the subset-200 the NLRM method took about 
5 days for training and the MMD and MDA took more than 
8 days. On subset-200 DELM takes only 9.02 seconds for 
training and achieves better classification accuracy. 

The subset-1000 contains 15 million frames in 1000 
training image sets and 36 thousands frames in 2580 test 
image sets. Therefore, previous image set classification 
methods have a huge computational and memory require¬ 
ment on this subset. This makes the experimental evalu¬ 
ation and the parameter tuning of these methods very dif¬ 
ficult and extremely time consuming. Therefore, on the 
subset-1000 we only report the results of the proposed al¬ 
gorithm and compare to Multi-Task Joint Sparse Represen¬ 
tation (MTJSR) [18]. Note that the accuracies of MTJSR 
in Table 2 are provided by the original author [18]. The 
proposed algorithm has comparable or better accuracy than 
the MTJSR on different subsets. However, the reported test¬ 
ing time of MTJSR in [18] is very high (3,254 seconds) on 
the subset-1000. In contrast, DELM only takes 350 sec- 
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Figure 4. Average accuracy of different image set classification 
algorithms when the image-sets are corrupted by noise. 


onds for training and 1.7 seconds for testing. Thus, com¬ 
pared to previous image set classification algorithms the 
proposed DELM based framework is more scalable to large 
scale datasets. 

Robustness: We tested the robustness of our algorithm to 
noise in a setup similar to [3, 27]. From the Honda dataset, 
we generate clean data containing 100 randomly selected 
images per set in both the gallery and the probes to ensure 
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Figure 5. Robustness to the number of images per set. N r samples 
are randomly selected. 


that the same ratio of noise can be added to all sets. Image 
sets are then corrupted by adding one randomly selected 
image from each of the other classes. The original clean 
data and the three noisy cases are referred to as N c (clean), 
Nq (only Gallery has noise), N P (only probe has noise) 
and Nq+p (both gallery and probe have noise). Figure 4 
shows that the proposed algorithm shows more robustness 
compared to other methods. As expected, sample based 
algorithms (AHISD, CHISD, SANP) are more sensitive to 
noise compared to the structure based ones, since modelling 
the set as a whole can resist the influence of noisy sam¬ 
ples. We also perform evaluation with respect to varying 
numbers of samples per set. From the Youtube celebrities 
dataset, we randomly selected N r samples from each image 
set (both training and testing) and used them for recogni¬ 
tion. In case there are less than N r samples in a set, we 
use all the samples. Figure 5 shows the average accuracy 
of different methods for three values of N r . The proposed 
algorithm is more robust and consistently outperforms other 
for decreasing value of N r . 

Execution Time: We compare execution times on the 
Youtube celebrities dataset. Table 3 shows the average ex¬ 
ecution times over the 5-fold experiments using a Core i7 
3.4GHz CPU with 8GB RAM running MATLAB. The pro¬ 
posed algorithm is significantly faster than the existing state 
of the art in both training and testing. For example, our 
method takes only 1.01 seconds in training compared to 
6542 seconds for NLRM, while achieving better accuracy. 

Memory Requirement: We also compare the training 
memory size requirement of the proposed algorithm with 
other algorithms on the Youtube Celebrities dataset. DELM 
has lower training memory requirements (14.3MB) to 
achieve better classification results compared to other im¬ 
age set classification algorithms (Table 3). 

5. Conclusion 

We presented an algorithm for learning of the non-linear 
structures of image sets for efficient and accurate classi¬ 
fication. Our algorithm does not make any assumptions 
about the underlying image-set data and is scalable to large 


Table 3. Execution times (in seconds) and training memory re¬ 
quirements (in megabytes) on the Youtube celebrities data. Test 
time is for matching one probe set to 141 gallery sets. 


Algorithm 

Training (sec) 

Testing (sec) 

Memory (MB) 

DCC[14] 

167.49 

8.08 

20.8 

MMD[28] 

313.57 

78.32 

150.2 

MDA[26] 

580.70 

201.48 

> 4 x 10 4 

AHISD[3] 

- 

18.10 

93.7 

CHISD[3] 

- 

190.61 

971.4 

SANP[7] 

- 

17.94 

160.6 

CDL[2 ] 

345.88 

13.08 

238.8 

GGDA[5] 

450.92 

20.24 

200.0 

SSDML[29] 

400.01 

21.87 

127.7 

NLRM [6] 

6542 

0.54 

523.7 

Proposed DELM 

1.01 

0.06 

14.3 


datasets. Non-linear structure is learned with the Deep Ex¬ 
treme Learning Machines (DELM) that enjoy the very fast 
training times of ELMs while providing deeper representa¬ 
tions. Moreover, DELM models can be accurately learned 
from smaller image sets containing only a few samples. Ex¬ 
periments on five benchmark datasets show that our algo¬ 
rithm consistently outperforms 10 existing state of the art 
methods in terms of accuracy and execution time. 
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